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Abstract 

We use a generating function approach to examine the errors on quantities related to counts in cells 
extracted from galaxy surveys. 

The measurement error, related to the finite number of sampling cells, is disentangled from the "cosmic 
error", due to the finiteness of the survey. Using the hierarchical model and assuming locally Poisson 
behavior, we identified three contributions to the cosmic error: 

• The finite volume effect is proportional to the average of the two-point correlation function over the 
whole survey. It accounts for possible fiuctuations of the density field at scales larger than the sample 
size. 

• The edge effect is related to the geometry of the survey. It accounts for the fact that objects near the 
boundary carry less statistical weight than those further away from it. 

• The discreteness effect is due to the fact that the underlying smooth random field is sampled with 
finite number of objects. This is the "shot noise" error. 

To check the validity of our results, we measured the factorial moments of order 7\r < 4 in a large number 
of small subsamples randomly extracted from a hierarchical sample realized by Raighley-Levy random walks. 
The measured statistical errors are in excellent agreement with our predictions. The probability distribution 
of errors is increasingly skewed when the order N and/or the cell size increases. This suggests that "cosmic 
errors" tend to be systematic: it is likely to underestimate the true value of the the factorial moments. 

Our study of the various regimes showed that the errors strongly depend on the clustering of the system, 
i.e., on the hierarchy of underlying correlations. The Gaussian approximation is valid only in the weakly 
non-linear regime, otherwise it severely underestimates the true errors. 

We study the concept of "number of statistically independent cells" (re) defined as the number of sampling 
cells required to have the measurement error of same order as the cosmic error. This number is found 
to depend highly on the statistical object under study and is generally quite different from the number 
of cells needed to cover the survey volume. In light of these findings, we advocate high oversampling for 
measurements of counts in cells. 

keywords large scale structure of the universe - galaxies: clustering - methods: numerical - 
methods: statistical 



1. Introduction 



The distribution of galaxies is generally admitted to be homogeneous at scales larger than ~ 150 
Mpc. Statistics naturally characterizes this distribution under the assumption that galaxies are a 
discrete realization of a continuous random field. This underlying smooth field is related to the 
distribution of luminous matter, thus it might not necessarily represent the total mass contained 
in the Universe. The primary purpose of quantifying structure in galaxy catalogs is to measure 
the properties of the underlying random field. Practically, however, galaxy surveys always cover a 
finite volume of space and contain finite number of objects. Possible fluctuations of the random 
field at the boundaries of the survey and at scales larger than the survey size, together with 
the Poisson fluctuations related to the discrete nature of the sample, introduce uncertainties on 
the measurements. These effects are present in any finite galaxy catalog, hence the name "cosmic 
error". This paper puts forth a quantitative analysis of the situation by calculating the theoretically 
expected errors on statistics related to counts in cells. 

Any statistic, aiming at extracting the properties of the underlying random field, measures 
deviations from a homogeneous random distribution, i.e., one without any correlation between 
galaxies. The most widely used measures are the two-point correlation function and its fourier 
transform the power-spectrum (see, e.g., Peebles 1980). Function ^2 corresponds to the excess 

probability of pairs compared to random. Uncertainties on the measurement of ^2 and of 
have been discussed by various authors (e.g., Peebles 1973; Peebles 1980; Landy & Szalay 1992; 
Hamilton 1993; Bernstein 1994; Feldman, Kaiser, & Peacock 1994; Colombi, Bouchet & Schaeffer 
1995, hereafter CBSII) and, as a result, the correlation function has become a well controlled 
tool. However, the large degree of inhomogeneity in the galaxy distribution, manifesting in large 
voids (e.g., de Lapparent et al. 1986; Kirshner et al. 1987; Geller & Huchra 1989), clusters and 
superclusters (e.g., Abell 1958; Bahcall 1988) is not fully described by this statistic, which only 
accounts for a Gaussian distribution adequately. To measure non-Gaussian features, higher order 
correlation functions, are needed (e.g., Peebles & Groth 1975; Groth & Peebles 1977; Fry & 
Peebles 1978; Sharp, Bonometto, & Lucchin 1984). Unfortunately it is difficult to measure and 
interpret the iV-point correlation functions, especially when N > 5, mostly because of the large 
number of parameters involved. In particular, the expected uncertainties on the measurements are 
rather difficult to estimate. 

Alternative to correlation functions, counts in cells estimate the probability of finding N objects 
in a cell of given size thrown at random in the survey. They depend on integrals of the M-point 
correlation functions, M > 2, thus characterize, although indirectly, the clustering of galaxies 
to greater accuracy than the two-point function. Describing only the scaling of the underlying 
distribution with the cell size, they are much simpler to deal with than the iV-point correlation 
functions. Methods based on counts in cells and related statistics such as moments, and moment 
correlators were thus applied to several galaxy catalogs (e.g., Alimi, Blanchard & Schaeffer 1990; 
Maurogordato, Schaeffer & da Costa 1992; Szapudi, Szalay & Boschan 1992; Meiksin, Szapudi & 
Szalay 1992; Bouchet et al. 1993; Gaztahaga 1994; Szapudi et al. 1995) and iV-body simulations 
(e.g., Bouchet, Schaeffer & Davis 1991; Bouchet & Hernquist 1992; Lucchin et al. 1994; Baugh, 
Gaztahaga & Efstathiou 1995; Colombi, Bouchet, & Hernquist 1995). The assessment of the errors 
on these measurements is even more delicate than the measurements themselves. 

One possible procedure to estimate the errors consists of generating a large number of random 
realizations modeling the data set, and measure the dispersion of the measurements experimentally 
(see, e.g., Baugh, Gaztahaga & Efstathiou 1995). Such Monte Carlo methods are rather costly. 
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since to date the main technique to generate an artificial sample with realistic statistics is the costly 
iV-body simulation. Therefore the number of realizations used for such measurements is severely 
limited by the available computer resources. The other option is a full scale analytic calculation. 
So far, no real detailed analytic study of the errors on counts in cells has been done, except for 
the void probability (CBSII), and to some extent for the second order moment (which is more or 
less equivalent to the two-point correlation function, e.g., Efstathiou et al. 1990; CBSII). Our aim 
here is thus to deal in detail with the errors on quantities related to counts in cells, especially the 
factorial moments of the count probability distribution. These latter can be simply calculated from 
the counts in cells, and they estimate the moments of the underlying smooth random field in a 
consistent and unbiased manner (e.g., Szapudi & Szalay 1993a, hereafter SSI). 

As will be shown later, the measurement of the probability distribution of counts in cells in 
a galaxy catalog is burdened with errors from various possible sources (if all systematics of the 
observations are discounted): 

• The usual way of estimating probability distributions consists of building histograms from 
finite number of randomly thrown cells, which introduces measurement errors. 

• The survey spans only a finite portion of the Universe, which introduces finite volume error, 
related to fluctuations of the density field at scales larger than the sample size. 

• The geometry of the survey causes edge effects, due to the fact that objects near the edge of 
the survey are given less statistical weight than objects far away. 

• The galaxies sample the underlying continuous field with finite number of objects, which 
creates discreteness errors. 

The first of these contributions can in principle be eliminated by efficient computer algorithms 
(e.g., Szapudi 1995). The other three are, however, present in all samples even under ideal con- 
ditions, because they are simply due to the finiteness of the part of the Universe we have access 
to. These errors can be systematic, i.e. the mean can be substantially different from the most 
likely value. As will be shown later using a Rayleigh-Levy hierarchical sample, it is more likely to 
underestimate the real moments of the distribution than to overestimate them (see also Colombi, 
Bouchet, & SchaefFer 1994, hereafter CBSI; Colombi, Bouchet, & Hernquist 1995). 

The presentation is organized as follows: in §2, after introducing the general formalism, the 
measurement errors are evaluated. §3 describes the framework of the hierarchical model, which 
is, together with a locally Poisson approximation, used in §4 to calculate the errors on factorial 
moments of order up to four explicitly. In §5, the theoretical results are compared to error estimates 
from subsamples extracted from a Raighley-Levy fractal. The study of the distribution of errors 
shows that it can be strongly skewed implying that the errors are systematic. In §6, we summarize 
the results and discuss some applications, such as the validity of the Gaussian approximation and 
the concept of "number of statistically independent cells". The appendices contain mathematical 
derivations which would have interrupted the flow of the main text. 

2. General Formalism 

In this section, we present a formalism for calculating the theoretically expected errors on es- 
timates related to counts in cells measured in a finite galaxy catalog. Note that the following 
formalism can be applied to any random distribution of points windowed by a finite box. 



3 



Let us imagine that we have a galaxy catalog of volume V, corresponding to length scale L, and 
we attempt to measure the probability distribution in cells of volume v, corresponding to length 
scale I. The theoretical dispersion of the measurement is partly due to throwing finite number 
of cells and partly reflects the finite nature of the data set. The former effect can in principle be 
eliminated by throwing a very large (or infinite) number of cells. The latter effect corresponds to 
the "cosmic error" in our data set. 

Let denote the probability of finding N galaxies in a cell of size I. The factorial moments 
(see, e.g., SSI) are defined as 

Fk = {{N)k) = J2WkPN, (1) 

where {N)k = N[N — 1)...[N — A; + 1) is the A;-th falling factorial of N . The ensemble average ( ) 
can be evaluated through the probability distribution for any 1-point quantity. We introduce 
the generating function of the probability distribution 

oo 

P{x) = J2 Pn^""^ (2) 
Ar=o 

and F{^x) = P(a; + 1) (see, e.g., SSI), the exponential generating function of the factorial moments, 
i.e. 

fc>o 

Similarly to the above definitions we introduce the quantities P^, and the generating functions 
P^(a;) and F'^[x), corresponding respectively to the estimates of P^, Pfc, P{x) and F[x) from 
randomly throwing C number of cells in the galaxy catalog. Note that this notation refers implicitly 
to a particular set of cells; another set of cells can give different results. The equation 

F''{x) = P'^ix + l) (4) 

still holds for a given set of cells, so it is quite easy to pass from P'^[x) to F'^[x). If Ni denotes 
the number of objects in cell "i" , then the following equations are true 

Pn = ^E<^(iV, = iV) 

P^{x) = E-"^^^ = ^E-"^N (5) 
Ar>o i=l 

and S[N = M) is the Kronecker-delta. It is easy to see that the ensemble average of P'^[x) is 
P^(a;)^ = P{x), the underlying generating function. The usual measure of error on the counts in 
cells and the factorial moments is the dispersion 
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c 



^ ''■~:C\2\\ //f-C^^^ 



Aftj =mr)),-{{it))^. (6) 

where the ensemble average is taken first over the measurements, and the operator ( )q averages 
over all possible ways of throwing C cells in the survey volume V. The error generating function 
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E'^'^ [x,y) can be defined in such a way that the coefficients of {xy)^ in its series expansion are 

N,M 

Similarly, the exponential generating function for the errors on factorial moments is 

AT M 

N,M 

Straightforward calculation yields 

E^'^{x,y) = {{P^{x)P^{y)))^ - ((P^(.)))^ ((^^(^z)))^ • (9) 
Clearly (^(^P'^ {x)^"^ ^ = P{x). From equation (5), we have 



c 

\ ' c 

c c 



\l = l i^j I Q 



C 



We introduced the notation 



P°°(a;)P°°(2/)) [ d''nd''T2P{x,y), (11) 

where D is the dimension of the survey, P(^x,y) is the generating function of the underlying bivariate 
probability distribution Pnm for two cells located at ri and r2 with N and M particles respectively. 
The integral is performed over the volume V{1) corresponding to all possible positions of cells 
entirely included in the survey volume V. Therefore, we have F ~ F in the case v <^ V, and 
V <^V when the cell size becomes comparable to the survey size. In the second line of equation (10), 
the sum is separated into two parts i = j and i ^ j. The second part proves to be the Monte 
Carlo realization of equation (11). In what follows, we drop the index oo from P°°(a;) = P{x) for 
simplicity. If the volume V tends to infinity in equation (11), "statistically independent" cells will 
dominate the ensemble averaging, therefore (^P[x)P[y)'^ P[x)P[y), and the error generating 
function is 

EC,oo^.,y)=^M^p^. (12) 

With the notation 

E'-'^{x,y) = {P{x)P{y)) - {P{x)) {P{y)) , (13) 

we have the equation 

E'''''{x,y) = (l - ^) E°°'''i^,y) + E'''°°{x,y)- (14) 
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The total error on the measurement of factorial moments can thus be (approximately) disentangled 
into two parts: the "cosmic error", E°°'^ , due to the finiteness of the sample, and the measurement 
error due to using only C cells. 

This intuitive notion was present in the literature before (see, e.g., Hamilton 1985; Politzer & 
Preskill 1986; Maurogordato & Lachieze-Rey 1987), and materialized in the fuzzy concept of "num- 
ber of statistically independent cells". This number, C* , corresponds to the number of sampling 
cells needed to extract all (actually most of) the statistically relevant information in the survey. 
From equation (14), it is natural to choose C* such that the measurement error and the cosmic 
error are of the same order. Note, however, that some residual information still can be extracted 
using more cells, since the measurement error can be rendered arbitrarily small. Our choice of C* 
qualitatively matches the calculations of Politzer & Preskill (1986) at least for the void probability 
distribution function (CBSII). Another simple choice often found in the literature is to cover the 
sampled volume uniformly by cells requiring C* ~ V/v = Cy- We shall see in § 6 that with our 
more natural definition, C* depends sensitively on the statistical object under study and can be dif- 
ferent from Cv by several orders of magnitude. Note that in any case, to reduce the errors as much 
as possible, it is advisable to use many cells to measure count in cells, so that the measurement 
error is negligible compared to the cosmic error. 

In the following, we will evaluate E°°'^ [x,y), the generating function of the cosmic error. Subse- 
quently by "errors" we mean cosmic errors since the measurement errors are avoidable in principle. 

Following CBSII, the integration in equation (11) is split into two parts according to whether 
the cells overlap or not: 

{P{x)P{y)) = {P{x)P{y)) +{P{x)P{y)) (15) 

\ / \ / overlap \ / disjoint 

with 

{P{x)P{y)) ^-i-/ d^nd^r2P{x,y), (16) 

\ / overlap Y2 Jj.^2i 

(P{x)P{y)) . . ^ ^ / d^nd^r2P{x,y), (17) 

\ / disjoint Y'' Jr>2l 

and r = \ri — r2\. As shown below, the term (P[x)P[y)) contains two contributions: the 

\ / overlap 

discreteness effect brought by the sampling of the underlying smooth distribution with a finite 
number of points, and the edge effect due to the fact the statistical weight given to points decreases 
toward the boundary of the catalog (e.g., Ripley 1988). The term (P[x)P[y)) , shown later 

\ / disjoint 

to be proportional to the average of the correlation function over V, is due to fluctuations of the 
underlying random field at scales larger than the sample size: it is related to the finite volume of 
the catalog. 

The calculation of (P[x)P[y)) involves the generating function P[x,y) for overlapping 

\ / overlap 

cells. As proved in Appendix A., P(^x,y) can be obtained from the the trivariate probability 
generating function of the three non-overlapping cells corresponding to the two original cells as 

P{x,y) = P{x,xy,y), (18) 

where the parameters xy, x, and y are associated with the cell formed by the overlap area, and the 
rest of each cell {I,H , and J on Fig. 1 respectively). 
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According to the previous equations, error calculations on statistics related to counts in cells 
in a finite survey involve generating functions for counts in disjoint cells up to trivariate level. 
These functions can be expressed as integrals over the iV-point correlation functions ^Ar(''i, • • • , fN) 
(Balian & SchaefFer 1989, hereafter BS, SSI, Szapudi & Szalay 1993b, hereafter SSII), i.e., 

f °° ]v^(t - r 1 

P(a;) = exp|^ j^,^^^ J^d^n...d^TN^N{n,...,TN)j, (19) 



P{x,y) = exp N^Mh^+M 

[N,M -'V.^Kl.l. 

/ dPri...dPrNj d^r^+i ■ ■ .d^rN+M^N+M{ri, . . . ,rN+M)\ , (20) 

and 

nx,y,z) - exp< 2^ N\M\S\v^+^+s 
(N,M,S 

/ d^ri . . . d^TN / d^TN+l . . . d^TN+M 

Jvi Jv2 

/ d^rjsr+M+i ■ ■ -d^risr+M+s^N+M+siri, . . . ,r]sr+M+s)\ ■ (21) 

Jv3 ) 

The quantity N = NP^ = Fi is the average number of objects per cell. These equations together 
with equations (14), (15) and (18) solve the problem in principle: a model for the integrals of the 
higher order correlation functions yields the errors on factorial moments as a set of finite, although 
complicated, expressions. 



3. Hierarchical Model 



In the highly nonlinear regime ^2 ^ Ij an ansatz for the structure of the iV-point correlation 
functions is the hierarchical model (e.g., Peebles 1980; BS) 

^N{n,...,rN)= J2 QNkJ2 n (22) 
k=i 

where ^(r) = ^2('') = {j^ I t^oY^ ■ Such a model seems to give a reasonably good description of the 
statistics measured in the galaxy distribution (e.g., Groth & Peebles 1977; Fry & Peebles 1978; 
Sharp, Bonometto, & Lucchin 1984; Szapudi, Szalay & Boschan 1992; Meiksin, Szapudi & Szalay 
1992; Szapudi et al. 1995) and in iV-body simulations (e.g., Bouchet, SchaefFer & Davis 1991; 
Bouchet & Hernquist 1992; Fry, Mellott & Shandarin 1993; Bromley 1994; Lucchin et al. 1994; 
CBSI; CBSII; Colombi, Bouchet, & Hernquist 1995), particularly in the nonlinear regime. 

In equation (22), the summation is over all possible N^~'^ trees with N vertices. In the sum, 
every iijij) corresponds to an edge r^j =| — | in a tree spanned by ri, . . . , r^. For every tree. 
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there is a product of iV — 1 two-point functions, there is a summation over all the Bf^k labelings of 
all the K{^N) distinct trees. The scale independent average Qn '^^ defined as 

K{N) 
k=i 

where F^k are the form factors associated with the shape of cell of size unity (see Boschan, Szapudi 
& Szalay 1994 for details) 



FNk = d^qi ■ --d^qN H [\ ^^ ' V d^pid^p2\pi - P2I ^| • (24) 



The above product is running over the iV — 1 edges of a tree. Since the number of all tree graphs 
with N vertices is N^~'^, the generating function takes the following form: 

P{x) = explf2{x-lfTNQN\, (25) 
lAr=i J 

with Qi = Q2 = 1. The following shorthand notation is used 

where ^ is the average of the two-point correlation function in a cell 

^=v-^ j d\^d\2i{ri,T2). (27) 

Note that equation (25) is also valid if the iV-point correlation functions obey the scaling relation 
(BS) 

^Ar(Ari, . . . , Xtm) = A-^(^-i)^Ar(ri, . . . , r^), (28) 

which is more general than the hierarchical model (22). 

To obtain workable expressions of the bivariate and trivariate distributions supplementary as- 
sumptions are needed. We quote here two approximations, hereafter SS and BeS, worked out 
respectively by SSI, SSII and by Bernardeau & Schaeffer (1992). 

If the distance r between the two cells is large compared to the cell size I, the correlation between 
two particles in each cell is ~ ${r). The integral in equation (20) can be then well approximated 
as Qn+m^n^m^M^ (see Szapudi, Szalay & Boschan 1992; Szapudi et al. 1995) up to linear 
order in ^/^. This was found to be fairly accurate even when the cells are touching. The bivariate 
generating function can thus be written, in this framework, 

P{x,y) ~ P{x)P{y)exp{R{x,y)} 

~ P{x)P{y) [1 + R{x,y)] + O(^VP)- (29) 

00 

R{x,y) = $ J2 {x-l)^{y-lfQN+MTMTNNM. 

M=1,N=1 
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This is the SS approximation. A similar expression was developed in SSII for the trivariate gen- 
erating function which could be used, in conjunction with the results of the previous section to 
evaluate the cosmic errors. However, the locally Poisson approximation developed in the next 
section eliminates the need for using it, therefore we do not quote it here explicitly. 

The other approach, BeS, consists in using a special (but still quite general) case of the hierarchi- 
cal model. It is assumed that QNk, the structure constant associated with a tree topology labelled 
with k can be written as 

oo 

QArfc = n^f^'^ (30) 

i=2 

where i/i is a weight associated with a vertex with i lines, and di(^k) is the number of such vertices 
in tree type k. Under this condition, the bivariate generating function is approximated by 

P{x,y) P{x)P{y) [l + r {(1 - x)N^} r {(1 - y)N^} ^/f] + O(^VP), (31) 

where 

t{s) 



\ 



^U-n^^^^^^- (32) 



iV! 

N>2 



Again, this approximation can be generalized to higher order multivariate generating functions (see 
Bernardeau & Schaeffer 1992). 

Although the two approximations SS and BeS appear quite different formally, we shall see in § 6 
that in the regimes we considered in this paper they are practically identical. 

Note, that the hierarchical model does not hold in the weakly non-linear regime ^2^1 (^-g-, Fry 
1984; Bernardeau 1992), where similar but different formalism has to be applied. We conjecture 
that the final result will be quite similar (Bernardeau 1994a), although the proof is left for future 
work. 



4. Calculation of the Cosmic Error 

4.1. Contribution from Disjoint Cells 

Using the approximations of the previous section, the finite volume effect (second term in eq. [15]) 
can be computed. 

(n^)Hy))^.^.^.^^ - n^)ny) [i + aL)^x{x,y)] , (33) 

where 

00 

^ss{x,y) = J2 {x-l)^{y-lfQN+MTMTNNM. (34) 
M=i,Ar=i 

^Bes{x,y) = T{{l-x)N^}T{{l-y)N^}/p, (35) 

are derived from the two approximations quoted for the bivariate generating function, and ${L) is 
given by the following integral 

aL) = V-' [ _ _ d''nd''T2ari,r2). (36) 

JrieV,r2eV,r>2l 
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When V jV <C 1, this quantity is approximately the average of the two-point correlation function 
over the survey volume: 



L 



(37) 



4.2. Contribution from Overlapping Cells 



The calculation of the first term, (P{x)P{y)) of equation (15) is burdened with more 

\ / overlap 

difficulty than the second one, because the trivariate generating function depends on the separation 
r between the two overlapping cells in a complicated way. The approximations SS and BeS could be 
generalized even though the cells are touching if we disregard the fact that the three cells formed by 
the overlap and the remaining parts are nonspherical. We leave this rather cumbersome calculation 
for subsequent research. 

Instead, we worked out a simple approximation, which, as we shall see in § 5 experimentally, 
provides sufficiently accurate estimate of first integral of equation (15) for practical purposes. This 
approximation consists of assuming locally Poisson behavior, i.e., that the correlations inside the 
union Cy of two overlapping cells Ci and C2 are smeared out. A natural consequence of this is 
that the effects of the nonsphericity of Cy can be neglected as well. 

Let us denote the volume of Cy by and the radius of the corresponding spherical cell with the 
same volume by l^j. For three dimensions, we have (CBSII) 



(38) 



and the two-dimensional case gives 



1 
TT 



2 arccos 



(39) 



According to the locally Poisson ansatz, the probability of an object to be in a portion of 
Cu is proportional to the volume of this portion. An object can belong to the overlapping part 
Cn = Ci n C2 or the rest Ci \ Cn of one of the cells with probabilities p = [1 — /£)('^)]/[l + foi,'*!^)] 
and q = f£)['tp)/[l + /d(V')]) respectively. The probability i j of finding H, I and J objects in 
Ci\C'n) Cr\ and C2\Cn, respectively, under the constraint H + I + J = S (Fig. 1), is a "trinomial" 
distribution 

5! 



pS 



H\I\J\ 



(40) 



with the following generating function 



P^{x,y, z) = {qx +py+ qz)' 



(41) 



Since the effects of the nonsphericity of the volume formed by Ci U C2 can be neglected under the 
locally Poisson assumption, the unconstrained probability Ph,i,j of finding /, J, K respectively in 
Ci \ Cn, Cn and C2 \ Cn is simply 



pU T3H+I+J 
^H+I+J^H,I,J ' 



(42) 
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S=H+I+J 



N=H+I M=I+J 

Fig. 1. — This is a symbolic drawing of the two overlapping cells Ci and C2. The intersection 
Cn = Ci n C2 contains / objects. Each remaining part, Ci \ Cn and C2 \ Cn, respectively, contains 
H and J objects. We compute the probability of having N = H + I objects in Ci and M = I + J 
objects in C2. The locally Poisson behavior allows us to neglect both the correlations inside the 
union Cy = Ci U C2 of the two cells and the nonsphericity of Cu- 



ll 



where the probability of finding S objects in a spherical cell of radius 1^ is 

P^ = PsU=iu- (43) 
The trivariate generating function thus can be written as 

P{x, z,y) = J2 PsiQx +Py + Qz)^ = P'^iQx + PV + Q^)- (44) 
s 

Then, according to equation (18), the generating function of the bivariate probability Pn,m of 
finding N = H + I and M = / + J in the overlapping cells Ci and C2, respectively, is 

P\x,y) = P''{q{x + y)+pxy), (45) 

which takes the following form from equation (25) 

P\x,y) = exp\j2^NQN[l + fDm~^''-'^^'^^ 
[n>i 

[fomx + y) + [l- IdW] xy-1- foWf} , (46) 

where 73 is the slope of the three-dimensional correlation function. In this equation, the effect of 
higher order clustering is fully contained in the term T^Qn, evaluated at the original cell size I of 
Ci and C2. 

Changing variables i/j = r/l, the first term of equation (15) becomes 

{P{x)P{y)) - ^ [' Di;^''-'Ui;P\x,y) (47) 

\ / overlap y Jq 

up to leading order in v jV (see Appendix B. for details). This contribution is inversely proportional 
to Cv = V/v, the number of cells needed to cover the survey volume. 



4.3. Theoretical Results: Cosmic Error on Factorial Moments 

After carrying out the appropriate subtractions, the previous results can be summarized as 

E°°'''{x,y) aL)^x{x,y) + ^ {£ D,l;^^-'U,l;P\x,y) - 2^P{x)P{y)^ , (48) 

to leading order in v/V, where Sx(a;, 2/) is given by equations (34) or (35), P^(^x,y) by equation (46) 
and P{x) by equation (25). The cosmic error A°°'^ on the factorial moment of order k, can be 
computed by expanding the generating function E°°'^ [x + l,y+l) and then integrating numerically 
the function P^ . The final result can be expressed as a function of the Fj's with j < 2k. 

We computed A°°'^Ffc for A; < 4. In the three-dimensional case, D = 3, we considered various 
values of 7: 7 = 1.8, 1.5, 1.2, and 0.9. In the two-dimensional case, D = 2, we carried out the 
calculation only for 7 = 73— 1 = 0.8. 

Here we present the results explicitly for 73 = 1.8, .0 = 2,3 and A; < 3. Because of their physical 
significance, we disentangled the terms corresponding to the finite volume, edge and discreteness 
effects: 

(A°°'^Ffc) ^ = (A^^^'^Ffc) ^ + (^A^'^s^Fk) ^ + (^A'^''^'^'^Fk) ^ . (49) 
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10.59 iVp + 74.23 iV^^^ + 1.709 + 42.37 iV + 
148.5 iV^pQa + 111.2 iV^p + 44.40 iVp ^4 + 29&AN'^i^Qi + 
349.3 p gs) iV^^. (65) 



Let us discuss the significance and meaning of the different terms in these equations. ^A^^^'^/^j^, 

with X=SS or BeS, is proportional to i{L) (and formally independent of the spatial dimension D). 
It corresponds to the finite volume effect due to the fluctuations of the underlying random field at 
scales larger than the sample size. The relative finite volume error A^'^'^^Fk/Fk is independent of 
N, or, in other words, independent of the number of objects iVpar in the catalog, in accordance 
with intuition. The two approximations SS and BeS, although quite different formally, gave similar 
result in all practical cases studied. To illustrate this point, we computed the expected finite volume 
errors in these two approximations for a three-dimensional hierarchical sample S with power law 
correlation function ^ = (^l/lo)~'^^ , 75 = 1.8, and Iq = L/20. For the amplitude of higher order 
correlations we take the (Jat's measured by Gaztahaga (1994) in the APM survey (corrected to 
have three-dimensional statistics, see also Szapudi et al. 1995) 

^3 = 1.35, ^4 = 2.33, ^5 = 4.02, Qe = 6.7, Q7 = W, = 12. (66) 

Figure 2 displays the quantity A^^^F^/Fk as a function of l/L. Each panel corresponds to a 
given order k. The difference between the two approximations is mostly negligible, at most 20%. 
A^^^Fk/Fk increases with k, as expected, and it exhibits two plateaux: one in the weakly nonlinear 
regime ^2^1) and one in the highly nonlinear regime ^2 ^ Ij as can be easily inferred from 
equations (50), (54), (55), (60) and (61). 

The term (a^-^^)^ corresponds to the edge effects. It is due to the fact that the statistical 
weight given to objects near the edge is smaller than far away from it. It can be clearly disentangled 
from the discreteness effect term ^A'^'^'^^'^'^Ffej corresponding to the "shot noise" introduced by 
the finite number of objects in the catalog. Indeed, in the continuous limit (iV 00) the relative 
error A'^^^^^^^ F^ / F^ vanishes, whereas the relative error A^'^^^F^/Fk is independent on N, and 



proportional to y^v/V up to leading order in v/V. This also suggests that edge effects are non- 
existent (or very weak) for a Poisson (or a weakly clustered) sample, confirming the intuition that 
geometry is unimportant for (nearly) Poisson statistics. To illustrate these points. Figure 3 displays 
the quantities A^'^e^Fk/Fk and A'^^=^''^'^Ffc/Ffc for our reference catalog S, assuming various values of 
JVpar = 500, 5000, 50000. As expected, discreteness effects are larger at smaller scales, particularly 
when the number of objects iVpar is small, while edge effects increase with I. 
The formal expression for the edge and discreteness effects is 

A'^FkY = J2cf,,,,, ,,^iV'^f gf . . .g?^^, (67) 

with X=edge or X=discrete. The numerical value of the coefficients Ch,i,ji,...,j2N fairly insen- 
sitive to changes of 73 in the regime consistent with the observations (moreover the two and 
three-dimensional coefficients are quite similar), therefore the quoted equations constitute good 
approximations even for 73 7^ 1.8. To show this, we computed Ch,i,ji,...,j2N -D = 3, A; < 4, 
and for various values of 7 = 1.8, 1.5, 1.2 and 0.9. The corresponding errors 



14 



10 



lb- 

0.1^ 

< 



0.01 =- 



0.001 



1 — I — I I I 1 1 11 1 — I — I I I 1 1 



I I I I I I L 



0.01 



0.1 



10 



1 =- 



m 

^ 0.1k 

< 



0.01 



0.001 



l/L 

T — I — I I I 1 1 1| 1 — I — I I I 1 1 1| 1 — I — r 



I I 



0.01 



l/L 



0.1 




0.001 



10 



1 r 



< 



0.1 



0.01 



0.001 



: 1 


1 1 1 1 M| 1 


1 1 1 1 1 1 l| 


1 II: 












1 1 


1 1 1 


1 1 1 



0.01 



l/L 



0.1 



Fig. 2. — The relative finite volume error A^^^F^/Fk brought by fluctuations of the underlying 
random field at scales larger than the sample size L ~ V^^'^ is plotted as a function of cell size I, for 
our reference catalog S (see text). Each panel corresponds to a given value of k. The dotted-dashed 
and dashed curves correspond respectively to the approximations SS and BeS discussed in § 3. For 
k = 1, both approximations give the same results. For k > 2, they differ only by a small amount. 
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Fig. 3.— The error A^'^s^Fk/Fk brought by edge effects (long dashes) and the error A'^^=^''^'^Ffc/Ffc 
due to the finite number of objects in the catalog (dots-long dashes) are displayed as functions 
of cell size I, for our reference set S (see text). The number of objects in S is assumed to be 
JVpar = 500, 5000 and 50000. When iVpar increases, A'^'^'^'^^^^Fk/ Fk decreases whereas A^'^s^Fk/Fk 
remains constant. 
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(^A^dg^Fk/Fk^ + (^A'^^'^^'^^^Fk/Fk^ for our reference set S, assuming that iVpar = 5000, are dis- 
played on Figure 4 as functions of l/L. Each panel corresponds to a given value of the order k. 
Note that on Figure 4, the slope of the correlation function, 75, used to evaluate ^, is held fixed. 



5. Example: a Rayleigh-Levy Fractal 



In the previous section, two important hypotheses made the calculation of the errors possible: 
hierarchical model and locally Poisson behavior. The first assumption is supported by the statistical 
properties of the galaxy distribution and the measurements in iV-body simulations. Here, to justify 
the second ansatz, we estimate experimentally the errors on measurements of factorial moments in 
subsamples of a homogeneous Rayleigh-Levy fractal described in § 5.1, and compare the results to 
our theoretical predictions in § 5.2. Indeed, such a control sample is strongly clustered, thus far 
from showing a locally Poisson behavior. Another significant point, discussed in the introduction, 
is to examine the distribution of errors rather than just the dispersion of the measurements. Since 
we have access only to a unique part of the Universe, it is important to know to what extent the 
cosmic errors are systematic. To study this question analytically with the same degree of generality 
we used until now would be rather difficult. Instead, we measure in § 5.3 the distribution of errors 
in our control sample. 



5.1. The sample 

The sample J^, of iVpar = 128'^ points, was generated in a three dimensional unit torus (a 
cube of size Ljr = 1 with periodic boundary conditions) using 1024 Rayleigh-Levy random walks 
Wi = {Wij, j = 1,2048}. Each walk starts from a point Wi^i at a random position. In a given 
walk Wi, the next point Wi,j+i is chosen at random direction and at distance r from Wij drawn 
from the following distribution 

p{t >i) = {ip/iy, I > 

p{r >l) = l, l< l^. ^ ' 

The statistical properties of a Rayleigh-Levy fractal can be fully calculated once iVpar, e, and the 
percolation length Ip are known, in particular, the two-point correlation function ^(r) (Mandelbrot 
1975; Peebles 1980), hence f (see details in CBSII). Here we chose Ip = 4.38 10"^ and e = 1.2 so 

that f = (^j^^ with lo = and 7^^ = 1.8 = 3 — e. It can be easily shown that such a fractal 

obeys the special hierarchical model of equation (22) (see, e.g., Hamilton & Gott 1988; Bernardeau 
& Schaeffer 1992; CBSII; and eq. [30]). The approximation BeS for the finite volume effect error is 
thus exact in this particular case (up to leading order in v/V). To compute the errors on Fk up to 
A; = 4, we need the values of (J at up to iV = 8. We have approximately Qn ^ 2^-^N\/N^-'^ with 
some small correction from the effect of the smoothing over the cell. Following CBSII (see their 
Table 1), we have, with this correction, 

^3 ^ 0.514, ^4 ~ 0.200, ^ 0.0662, ^ 0.0199, ^ 0.0056, ^ 0.0015. (69) 

Figure 5 displays a thin slice of (Xjf/50 thick), showing the clumpy nature of our fractal, which 
is thus far from being locally Poisson. 
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Fig. 4. — This figure displays the dependence of the "overlapping" error, ^A°^^^^^'p Fk / F^j = 

(^A^ds^Fk/FkY + (A'^^=^''^'^Ffc/Ffc)^ on the value of 7 used to compute the coefficients Ch,i,j^,...,j^j, 

in equation (67). The quantity IS."^^^^^"^ F^, j F^ is plotted as a function of cell size, for 7 = 1.8, 
1.5, 1.2 and 0.9 and for our reference sample S (without changing 75), assuming that it contains 
JVpar = 5000 objects. The value of A°^^''l^PFfc/Ffc is increasing with I/7. 
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Fig. 5. — A thin slice thick) of our Rayleigh-Levy fractal is shown (see text). 

5.2. Comparison of the Theoretical Predictions with Numerical Results 

From J^, we extracted Ng-ah = 1000 cubical subsamples ^l^^ of size L = Ljr/4. The position 
of each subsample was choosen randomly. We also randomly diluted the subsamples by a factor 
= 64. In other words, the probability that a particle in the volume intersecting ^l^^ was 
selected is p = a. In the following, if A is a statistical quantity, A* is its measurement in ^l^^, 
and = -^subSi^i is the average of over the N^uh realizations. If Np^r is the number of 

particles per subsample, its average is thus ^iVpar^ — cciVpar = 512. In each subsample we measured 

the factorial moments F^, using a large number C > 128^ of cells so that the measurement error 
A'^'°°Fk was negligible. The (biased) experimental estimate of the error on FJ^ is thus 

[AFkY = (^(^SFkYy (70) 

with 

6FI = FI-{Fu)^. (71) 

Similarly to the previous considerations, there is measurement error on the error due the finite 
number of subsamples extracted from J^, and there is cosmic error on the error due to the finite 
size of J^, its geometry and the finite number of objects it contains. The former can be estimated 
by straightforward error propagation, the latter is rather complicated because it depends on, e.g., 
16''^ order quantities for the 4'^^ order cosmic error on the error. 

Figure 6 displays the measured value oi AFk/Fk (circles) versus the theoretical predictions (solid 
curves), which use the approximation BeS to compute the finite volume error Afinit^Pk/Fk (short 
dashes). Note that the SS approximation would give similar results. The edge effect contribution 
A^'^s^Fk/Fk (long dashes) and the discreteness contribution A'^^^^^^^^Fk/ Fk (dot-long dashes) are 
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Fig. 6. — This figure shows the measured value of the error AFk/Fk on the measurement of Fk in 
the fractal subsamples ^l^^ as a function of scale (dots), and our theoretical prediction (continuous 
curve). Each panel corresponds to a value of k. The BeS approximation was used to compute 
the finite volume error A^^^^^Fk/Fk (short dashes). The edge effect contribution A^^^^Fk/Fk (long 
dashes) and the discreteness contribution A'^^^^^^^ Fk / Fk (dot-long dashes) are also displayed. 
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also displayed. There are errorbars on the dots, when sufficiently large to be visible. They corre- 
spond to the measurement error on the error. Note the excellent agreement of our predictions with 
the numerical experiment, which also shows that the cosmic error on the error problem mentioned 
above is well controlled in our experiment. This particular example also illustrates how the shot 
noise contribution A'^^=^''^'^Ffc/Ffc to the error increases with k at small scales. The edge effect 
contribution is negligible at small I but always dominates at large scales, although the steadier 
finite volume effect contribution is of the same order in this regime. 

5.3. Distribution of the Errors 

Let us imagine now that we have access to only one subsample ^l^^ and measure F^. Is it more 
likely to under- or overestimate the real value of Fk? To answer this question, we measured the 
probability distribution function T(^6Fl) of the errors. The result is displayed in Figure 7 for two 
scales, X/64 (four top panels, each one corresponding to a given value of k) and L/4 (four bottom 
panels). On each panel, the errorbars on the measurements reflect the finite number of subsamples, 

and the continuous curve corresponds to a Gaussian with average zero and variance (^AFkJ ■ From 
Figure 7, the distribution of the errors is increasingly skewed with both increasing k and scale 
compared to a Gaussian. In particular, the maximum of function Ti^SF^) lies at a negative value of 
SFl, which indicates that it is likely to underestimate the real value of Fk, especially if the order k 
and/or the cell size I is large. This systematic effect was already pointed out by CBSI and CBSII 
who proposed a method to correct for it in some particular regimes using some assumptions on the 
asymptotic properties of the probability distribution. 

6. Discussion 

In this paper we calculated the theoretical error on statistics related to counts in cells in a finite 
galaxy catalog. We identified the different contributions to the total theoretical error (the system- 
atics of the observations are disregarded): the measurement error, due to the finite number of cells 
thrown to estimate the count probabilities (this in principle can be eliminated with the algorithm of 
Szapudi 1995), and the cosmic error, inherent to any finite catalog. The cosmic error itself has three 
contributions: the discreteness effect arising from sampling the underlying continuous random field 
with finite number of points, the edge effect caused by the lesser statistical weight given to objects 
near the boundary of the survey, and the finite volume effect, from fluctuations of the underlying 
random field at scales larger than the sample size. First, we presented the general mathematical 
formulation of the problem establishing a firm groundwork for subsequent applications, and solving 
the practical measurement error problem. For the cosmic error, in the framework of the hierarchical 
tree assumption, we have found a good approximation using a locally Poisson ansatz. The results, 
in excellent agreement with our control measurements performed on a Rayleigh-Levy hierarchical 
sample, give a simple and useful way of estimating the expected errors for a galaxy survey with 
prescribed properties. Measurements of the distribution of the errors on the same Rayleigh-Levy 
fractal showed that the cosmic errors tend to be systematic, i.e., it is more likely to underestimate 
the true value of Fk than to overestimate it, as already pointed out by CBSI. 

To further illustrate our results, we discuss here two important subjects. The first one concerns 
the dependence of the errors on the clustering properties of the underlying distribution. In partic- 
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Fig. 7. — The probability distribution function Ti^SF^) of the errors as a function of SF^/AFk, 
measured in our 1000 subsamples ^l^^. The four upper panels, each one for a given k, correspond 
to cell size I = L/64. The four lower panels correspond to I = L/4. On each panel, a Gaussian 

with average zero and with variance ^AF^j is displayed. 
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ular, is it fair to assume Gaussian behavior to compute the errors on statistics related to counts 
in cells? The second subject is the previously mentioned (§ 2) concept of "number of statisti- 
cally independent cells". Further applications, such as reanalyses of the cosmic error on counts in 
cells measured in galaxy catalogs, or sampling strategies for galaxy surveys (Kaiser 1986) will be 
discussed elsewhere (Colombi, Szapudi & Szalay 1995). Note also, as mentioned in § 3, that the 
hierarchical model is not expected to be valid in the weakly nonlinear regime, although it should 
be a good approximation for the calculation of the errors (Bernardeau 1994a). More detailed 
investigations of the weakly nonlinear regime are left for future work. 

In what follows, the approximation SS discussed in § 3 was used to compute the finite volume 
contribution to the error. 

Figure 8 illustrates the difference between Gaussian and full hierarchical assumptions for error 
calculations by displaying the expected cosmic error on the measurement of Fk for samples of 
volume V = L^, correlation function ^ = (^l/lo)~~' with 7 = 1.8 and Iq = L/20. All these fictive 
samples have the same number of objects iVpar = 5000, however, the higher order statistics is 
varied. Each panel corresponds to a given value of k. The long dashes assume Gaussian underlying 
statistics for the calculation of Fk and the calculation of the error. For k > 2, The (upper) short 
dashes, dots, and continuous curve assume that the hierarchy of (Jat's is given by perturbation 
theory predictions (see, e.g., Bernardeau 1994b) for an initial power spectrum oc A;" with 

n = —2, —1,0, respectively. For A; > 3, the lower short dashes, dots and continuous curve assume 
the same hierarchy of (Jat's for the calculation of Fk, but the errors are computed from Gaussian 
statistics. In the case k = 2, such an assumption would lead to the long dashes, whatever the values 
of the (Jat's. In the case k = 1, the error depends only on statistics up to second order, for which all 
the models under consideration are equivalent. These plots clearly show a strong dependence of the 
cosmic error on the underlying clustering properties of the system. Moreover, assuming Gaussian 
statistics to compute the errors seems unreasonable: the errors can be severely underestimated, 
except, as expected, in weakly nonlinear regime. 

Let us turn to a widely used but seldom explained concept: the number of statistically indepen- 
dent cells. We defined it in § 2 as the number of cells C* needed to sample the catalog so that the 
measurement error equals the cosmic error. This definition ensures that most of the statistically 
relevant information is extracted. In that sense, these cells can be considered as "statistically in- 
dependent". However, there must remain residual information in the survey obtainable via more 
sampling cells, since the overall error can be decreased by another factor of two. This illustrates 
the level of arbitrariness in the concept of "number of statistically independent cells" for a sample 
of finite volume. Another popular but certainly erroneous choice is C* = Cy = V/v, the number 
of cells needed to cover the sample volume V. To compare this choice with our more natural defi- 
nition. Figure 9 displays the quantity C* jCy for our reference sample S (see § 4.3), assuming that 
it contains iVpar = 5000 objects (left panel) and in the continuous limit iVpar = 00 (right panel). 
It has been computed from the errors on Fk, with k = 1 (solid curves), 2 (dots), 3 (short dashes) 
and 4 (long dashes). First thing to notice is that the number of "statistically independent cells" is 
not universal: C* depends on the statistical object under study. In our example it increases with 
k. It is generally different from Cy by several orders of magnitude. Note that C* is smaller in the 
continuous limit than for finite iVpar, contrarily to the expectation motivated by the fact that the 
shot noise tends to increase the cosmic errors. Another shot noise contribution to the measurement 
error, however, increases when iVpar is small, thus explaining this counterintuitive effect. As already 
discussed in § 2, throwing a number of sampling cells C ^ C* would decrease the overall errors by 
a factor two. Moreover C* highly depends on the statistical object under study, so we endorse the 



23 




Fig. 8. — This figure displays the relative cosmic error on the measured factorial moment Fk 
{k = 1, . . .,4), as a function of the cell size i for a catalog of volume V = L^, assuming that it 
contains iVpar = 5000 objects and its correlation function is ^ = (^l/lo)~^'^ with Iq = L/20. Each 
panel corresponds to a different value of A;. The long dashes assume underlying Gaussian statistics. 
For k > 2, the upper dots, short dashes and continuous curves assume that higher order statistics is 
given by perturbation theory predictions for scale invariant initial power-spectra of indexes n = —2, 
-1, respectively. For A; > 3, there are lower dots, short dashes and continuous curve. In that case, 
that the error AFk has been calculated assuming Gaussian statistics, but the Fk remain unchanged. 
The vertical dotted line on each panel marks the value of the correlation length Iq. 
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Fig. 9. — The "number of statistically independent cells" C* for the factorial moments of order k. 
It is divided by the number of cells needed to cover the catalog Cy = V/v and plotted as a function 
of cell size for our reference catalog S. The left panel assumes that it contains iVpar = 5000 objects, 
while the right panel corresponds to the continuous limit. The solid curve, dots, short dashes and 
long dashes correspond respectively to A; = 1, 2, 3 and 4. 



use of as many cells as possible for counts in cells measurements, or to use an algorithm which is 
equivalent to throwing infinite number of cells (Szapudi 1995). 
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Appendix 

A. Generating Function for Overlapping Cells 

In this Appendix, we show how to relate the bivariate generating function of overlapping cells to 
a trivariate generating function. Two overlapping cells can be imagined as three non-overlapping 
(touching) cells (see Fig. 1). Knowing the trivariate probability distribution Ph,i,j of these three 
cells, it is simple to express the bivariate probability distribution for the two original cells 

Pm,n = ^ <^(^ + / = N)S{I +J = M)Ph,i,j, (A1) 

H,I,J 

where / is the number count in the overlap area. The bivariate generating function can be expressed 
in terms of the trivariate generating function 

P{x,y) = J2 PM^NX^'y^ = P{x, xy, y), (A2) 

M,N 
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where 

P{x,y,z)= J2 PH,l,jx"y^zJ, (A3) 
H,I,J 

is the generating function corresponding to the tree cells. Alternatively, the path integral formalism 
of SSI (see their eq. [5.5]) gives the same result with the following special source (see also BS) 

J*{x) = Wi\2,r{^)z1 + Win2,R{^)ziZ2 + W.2\i^r{x)z2, 

( \ j lifxeV, (A4) 

where the W's are the characteristic functions of the union and the set theoretical differences of 
the two cells. 

The generalization of the above formulae for iV-variate generating functions with possible overlaps 
is a trivial, although tedious exercise. 



B. Contribution of the Error from Overlapping Cells 



In this Appendix, we improve the approximation of equation (47) in three dimensions (D = 3). 
For simplicity, we assume that the survey is spherical of radius R and that the origin of the 
coordinates is the center of the survey. The possible positions of cells of radius I contained in the 
survey are thus r < R = R — I. Introducing 



and using spherical coordinates, we have. 
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(B6) 



This two-dimensional integral could be easily evaluated numerically, in case more than leading 
order accuracy in v/V is needed to estimate the errors, particularly edge effects. 

For the sake of comparison, we compute explicitly the dispersion on the average count using the 
above (more accurate) expression. 



^ overlap ^ ^ overlap Ox Oy ^ ^ ^ ^ ^ ^'^ ^ 
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(B7) 



where we dropped the obvious "overlap" index on the right hand side. We consider the particular 
case V > 27v (V' > 2) and 7 = 3/2. The result is 
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. (B8) 



This equation is exact for a Poisson sample if we set ^ = 0. The approximation of equation (47) 
leads to 



N^) ^ + 8N^ + 5.860N^$\ , (B9) 

/ overlap v ^ -I 

showing that it is indeed correct up to the leading order in v/V. Note that for v/V = 1/27, 
equation (B8) gives 

iV^) = 0.078iV' + 0.469iV'2 + 0.348iV'2f, (BIO) 



overlap 

while under the same assumption (B9) yields 

N^) ~ 0.037iV' + 0.296iV'2 + 0.217iV'2f. (Bll) 

/ overlap 

The difference is less than a factor of 2, showing that the leading order in v/V is still a reasonable 
approximation. 
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